Goto

Collaborating Authors

 instruction-response pair



InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

Keita, Mamadou K., Diarra, Sebastien, Homan, Christopher, Diallo, Seydou

arXiv.org Artificial Intelligence

Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.


T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

Fu, Yanjun, Hamman, Faisal, Dutta, Sanghamitra

arXiv.org Artificial Intelligence

Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high-quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promotes robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.


Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

Pham, Anh, Thalanki, Mihir, Sun, Michael, Chaloo, Aditya, Gupta, Ankita, Xia, Tian, Mate, Aditya, Nosakhare, Ehimwenma, Srinivasan, Soundararajan

arXiv.org Artificial Intelligence

Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.



PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

Yin, Shangjian, Liang, Shining, Ding, Wenbiao, Qian, Yuli, Shi, Zhouxing, Li, Hongzhi, Xie, Yutao

arXiv.org Artificial Intelligence

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: https://github.com/SJY8460/PiKa.


SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Barati, Iman, Amiri, Mostafa, Faili, Heshaam

arXiv.org Artificial Intelligence

Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)


Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Zhou, Honglu, Peng, Xiangyu, Kendre, Shrikant, Ryoo, Michael S., Savarese, Silvio, Xiong, Caiming, Niebles, Juan Carlos

arXiv.org Artificial Intelligence

Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. T o bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs. 1


From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Ahmad, Wasi Uddin, Majumdar, Somshubra, Ginsburg, Boris

arXiv.org Artificial Intelligence

Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.


DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Wang, Chengyu, Yan, Junbing, Yue, Yuanhao, Huang, Jun

arXiv.org Artificial Intelligence

Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.